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Abstract 

We  present  an  approach  for  automatically 
learning  paraphrases  from  aligned  mono¬ 
lingual  corpora.  Our  algorithm  works 
by  generalizing  the  syntactic  paths  be¬ 
tween  corresponding  anchors  in  aligned 
sentence  pairs.  Compared  to  previous 
work,  structural  paraphrases  generated  by 
our  algorithm  tend  to  be  much  longer 
on  average,  and  are  capable  of  captur¬ 
ing  long-distance  dependencies.  In  ad¬ 
dition  to  a  standalone  evaluation  of  our 
paraphrases,  we  also  describe  a  question 
answering  application  currently  under  de¬ 
velopment  that  could  immensely  bene¬ 
fit  from  automatically-learned  structural 
paraphrases. 

1  Introduction 

The  richness  of  human  language  allows  people  to 
express  the  same  idea  in  many  different  ways;  they 
may  use  different  words  to  refer  to  the  same  entity  or 
employ  different  phrases  to  describe  the  same  con¬ 
cept.  Acquisition  of  paraphrases,  or  alternative  ways 
to  convey  the  same  information,  is  critical  to  many 
natural  language  applications.  For  example,  an  ef¬ 
fective  question  answering  system  must  be  equipped 
to  handle  these  variations,  because  it  should  be  able 
to  respond  to  differently  phrased  natural  language 
questions. 

While  there  are  many  resources  that  help  sys¬ 
tems  deal  with  single-word  synonyms,  e.g.,  Word- 
Net,  there  are  few  resources  for  multiple-word  or 


domain-specific  paraphrases.  Because  manually 
collecting  paraphrases  is  time-consuming  and  im¬ 
practical  for  large-scale  applications,  attention  has 
recently  focused  on  techniques  for  automatically  ac¬ 
quiring  paraphrases. 

We  present  an  unsupervised  method  for  acquir¬ 
ing  structural  paraphrases,  or  fragments  of  syntactic 
trees  that  are  roughly  semantically  equivalent,  from 
aligned  monolingual  corpora.  The  structural  para¬ 
phrases  produced  by  our  algorithm  are  similar  to  the 
S-rules  advocated  by  Katz  and  Levin  for  question 
answering  (1988),  except  that  our  paraphrases  are 
automatically  generated.  Because  there  is  disagree¬ 
ment  regarding  the  exact  definition  of  paraphrases 
(Dras,  1999),  we  employ  that  operating  definition 
that  structural  paraphrases  are  roughly  interchange¬ 
able  within  the  specific  configuration  of  syntactic 
structures  that  they  specify. 

Our  approach  is  a  synthesis  of  techniques  devel¬ 
oped  by  Barzilay  and  McKeown  (2001)  and  Lin 
and  Pantel  (2001),  designed  to  overcome  the  limi¬ 
tations  of  both.  In  addition  to  the  evaluation  of  para¬ 
phrases  generated  by  our  method,  we  also  describe 
a  novel  information  retrieval  system  under  develop¬ 
ment  that  is  designed  to  take  advantage  of  structural 
paraphrases. 

2  Previous  Work 

There  has  been  a  rich  body  of  research  on  automati¬ 
cally  deriving  paraphrases,  including  equating  mor¬ 
phological  and  syntactic  variants  of  technical  terms 
(Jacquemin  et  al.,  1997),  and  identifying  equiva¬ 
lent  adjective-noun  phrases  (Lapata,  2001).  Unfor¬ 
tunately,  both  are  limited  in  types  of  paraphrases  that 
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they  can  extract.  Other  researchers  have  explored 
distributional  clustering  of  similar  words  (Pereira  et 
ah,  1993;  Lin,  1998),  but  it  is  unclear  to  what  extent 
such  techniques  produce  paraphrases.^ 

Most  relevant  to  this  paper  is  the  work  of  Barzi- 
lay  and  McKeown  and  the  work  of  Lin  and  Pan- 
tel.  Barzilay  and  McKeown  (2001)  extracted 
both  single-  and  multiple-word  paraphrases  from  a 
sentence-aligned  corpus  for  use  in  multi-document 
summarization.  They  constructed  an  aligned  corpus 
from  multiple  translations  of  foreign  novels.  From 
this,  they  co-trained  a  classifier  that  decided  whether 
or  not  two  phrases  were  paraphrases  of  each  other 
based  on  their  surrounding  context.  Barzilay  and 
McKeown  collected  9483  paraphrases  with  an  av¬ 
erage  precision  of  85.5%.  However,  70.8%  of  the 
paraphrases  were  single  words.  In  addition,  the 
paraphrases  were  required  to  be  contiguous. 

Lin  and  Pantel  (2001)  used  a  general  text  corpus 
to  extract  what  they  called  inference  rules,  which 
we  can  take  to  be  paraphrases.  In  their  algorithm, 
rules  are  represented  as  dependency  tree  paths  be¬ 
tween  two  words.  The  words  at  the  ends  of  a  path 
are  considered  to  be  features  of  that  path.  For  each 
path,  they  recorded  the  different  features  (words) 
that  were  associated  with  the  path  and  their  respec¬ 
tive  frequencies.  Lin  and  Pantel  calculated  the  sim¬ 
ilarity  of  two  paths  by  looking  at  the  similarity  of 
their  features.  This  method  allowed  them  to  extract 
inference  rules  of  moderate  length  from  general  cor¬ 
pora.  However,  the  technique  is  computationally 
expensive,  and  furthermore  can  give  misleading  re¬ 
sults,  i.e.,  paths  having  the  opposite  meaning  often 
share  similar  features. 

3  Approach 

Our  approach,  like  Barzilay  and  McKeown’s,  is  built 
on  the  application  of  sentence- alignment  techniques 
used  in  machine  translation  to  generate  paraphrases. 
The  insight  is  simple:  if  we  have  pairs  of  sentences 
with  the  same  semantic  content,  then  the  difference 
in  lexical  content  can  be  attributed  to  variations  in 
the  surface  form.  By  generalizing  these  differences 
we  can  automatically  derive  paraphrases.  Barzilay 
and  McKeown  perform  this  learning  process  by  only 

*For  example,  “dog”  and  “cat”  are  recognized  to  be  similar, 
but  they  are  obviously  not  paraphrases  of  one  another. 


considering  the  local  context  of  words  and  their  fre¬ 
quencies;  as  a  result,  paraphrases  must  be  contigu¬ 
ous,  and  in  the  majority  of  cases,  are  only  one  word 
long.  We  believe  that  disregarding  the  rich  syntactic 
structure  of  language  is  an  oversimplification,  and 
that  structural  paraphrases  offer  several  distinct  ad¬ 
vantages  over  lexical  paraphrases.  Long  distance  re¬ 
lations  can  be  captured  by  syntactic  trees,  so  that 
words  in  the  paraphrases  do  not  need  to  be  contigu¬ 
ous.  Use  of  syntactic  trees  also  buffers  against  mor¬ 
phological  variants  (e.g.,  different  inflections)  and 
some  syntactic  variants  (e.g.,  active  vs.  passive). 
Finally,  because  paraphrases  are  context-dependent, 
we  believe  that  syntactic  structures  can  encapsulate 
a  richer  context  than  lexical  phrases. 

Based  on  aligned  monolingual  corpora,  our  tech¬ 
nique  for  extracting  paraphrases  builds  on  Lin  and 
Pantel’s  insight  of  using  dependency  paths  (derived 
from  parsing)  as  the  fundamental  unit  of  learning 
and  using  parts  of  those  paths  as  features.  Based 
on  the  hypothesis  that  paths  between  identical  words 
in  aligned  sentences  are  semantically  equivalent,  we 
can  extract  paraphrases  by  scoring  the  path  fre¬ 
quency  and  context.  Our  approach  addresses  the 
limitations  of  both  Barzilay  and  McKeown’s  and 
Lin  and  Pantel’s  work:  using  syntactic  structures  al¬ 
lows  us  to  generate  structural  paraphrases,  and  using 
aligned  corpora  renders  the  process  more  computa¬ 
tionally  tractable.  The  following  sections  describe 
our  approach  in  greater  detail. 

3.1  Corpus  Alignment 

Multiple  English  translations  of  foreign  novels,  e.g.. 
Twenty  Thousand  Leagues  Under  the  Sea  by  Jules 
Verne,  were  used  for  extraction  of  paraphrases. 
Although  translations  by  different  authors  differ 
slightly  in  their  literary  interpretation  of  the  origi¬ 
nal  text,  it  was  usually  possible  to  find  correspond¬ 
ing  sentences  that  have  the  same  semantic  content. 
Sentence  alignment  was  performed  using  the  Gale 
and  Church  algorithm  (1991)  with  the  following  cost 
function: 

cost  of  substitution  =  1  — 

anw 

new:  number  of  common  words 

anw:  average  number  of  words  in  two  strings 

Here  is  a  sample  from  two  different  translations 
of  Twenty  Thousand  Leagues  Under  the  Sea: 


Ned  Land  tried  the  soil  with  his  feet,  as  if  to  take 
possession  of  it. 

Ned  Land  tested  the  soil  with  his  foot,  as  if  he  were 
laying  claim  to  it. 

To  test  the  aeeuraey  of  our  alignment,  we  man¬ 
ually  aligned  454  sentenees  from  two  different  ver¬ 
sions  of  Chapter  21  from  Twenty  Thousand  Leagues 
Under  the  Sea  and  eompared  the  results  of  our  au- 
tomatie  alignment  algorithm  against  the  manually 
generated  “gold  standard.”  We  obtained  a  preeision 
of  0.93  and  reeall  of  0.88,  which  is  comparable  to 
the  numbers  (P.94/R.85)  reported  by  Barzilay  and 
McKeown,  who  used  a  different  cost  function  for  the 
alignment  process. 

3.2  Parsing  and  Postprocessing 

The  sentence  pairs  produced  by  the  alignment  al¬ 
gorithm  are  then  parsed  by  the  Link  Parser  (Sleator 
and  Temperly,  1993),  a  dependency -based  parser  de¬ 
veloped  at  CMU.  The  resulting  parse  structures  are 
post-processed  to  render  the  links  more  consistent: 
Because  the  Link  Parser  does  not  directly  identify 
the  subject  of  a  passive  sentence,  our  postprocessor 
takes  the  object  of  the  Z?y-phrase  as  the  subject  by 
default.  For  our  purposes,  auxiliary  verbs  are  ig¬ 
nored;  the  postprocessor  connects  verbs  directly  to 
their  subjects,  discarding  links  through  any  auxiliary 
verbs.  In  addition,  subjects  and  objects  within  rela¬ 
tive  clauses  are  appropriately  modified  so  that  the 
linkages  remained  consistent  with  subject  and  object 
linkages  in  the  matrix  clause.  For  sentences  involv¬ 
ing  verbs  that  have  particles,  the  Link  Parser  con¬ 
nects  the  object  of  the  verb  directly  to  the  verb  it¬ 
self,  attaching  the  particle  separately.  Our  postpro¬ 
cessor  modifies  fhe  link  sfrucfure  so  fhaf  fhe  objecf 
is  connecfed  fo  fhe  particle  in  order  fo  form  a  contin¬ 
uous  pafh.  Predicate  adjectives  are  converfed  info  an 
adjecfive-noun  modification  link  insfead  of  a  com- 
plefe  verb-argumenf  sfrucfure.  Also,  common  nouns 
denofing  places  and  people  are  marked  by  consulf- 
ing  WordNef. 

3.3  Paraphrase  Extraction 

The  paraphrase  extraction  process  starts  by  finding 
anchors  within  the  aligned  sentence  pairs.  In  our  ap¬ 
proach,  only  nouns  and  pronouns  serve  as  possible 
anchors.  The  anchor  words  from  the  sentence  pairs 


are  brought  into  alignment  and  scored  by  a  simple 
set  of  ordered  heuristics: 

•  Exact  string  matches  denote  correspondence. 

•  Noun  and  matching  pronoun  (same  gender  and 
number)  denote  correspondence.  Such  a  match 
penalizes  the  score  by  50%. 

•  Unique  semantic  class  (e.g.,  places  and  people) 
denotes  correspondence.  Such  a  match  penal¬ 
izes  the  score  by  50%. 

•  Unique  part  of  speech  (i.e.,  the  only  noun 
pair  in  the  sentences)  denotes  correspondence. 
Such  a  match  penalizes  the  score  by  50%. 

•  Otherwise,  attempt  to  find  correspondence  by 
finding  longest  common  substrings.  Such  a 
match  penalizes  the  score  by  50%. 

•  If  a  word  occurs  more  than  once  in  the  aligned 
sentence  pairs,  all  possible  combinations  are 
considered,  but  the  score  for  such  a  correspond¬ 
ing  anchor  pair  is  further  penalized  by  50%. 

For  each  pair  of  anchors,  a  breadth-first  search  is 
used  to  find  the  shortest  path  between  the  anchor 
words.  The  search  algorithm  explicitly  rejects  paths 
that  contain  conjunctions  and  punctuation.  If  valid 
paths  are  found  between  anchor  pairs  in  both  of  the 
aligned  sentences,  the  resulting  paths  are  considered 
candidate  paraphrases,  with  a  default  score  of  one 
(subjected  to  penalties  imposed  by  imperfect  anchor 
matching). 

Scores  of  candidate  paraphrases  take  into  account 
two  factors:  the  frequency  of  anchors  with  respect 
to  a  particular  candidate  paraphrase  and  the  variety 
of  different  anchors  from  which  the  paraphrase  was 
produced.  The  initial  default  score  of  any  paraphrase 
is  one  (assuming  perfect  anchor  matches),  but  for 
each  additional  occurrence  the  score  is  incremented 
by  i”,  where  n  is  the  number  of  times  the  current 
set  of  anchors  has  been  seen.  Therefore,  the  effect  of 
seeing  new  sets  of  anchors  has  a  big  initial  impact  on 
the  score,  but  the  additional  increase  in  score  is  sub¬ 
jected  to  diminishing  returns  as  more  occurrences  of 
the  same  anchor  are  encountered. 


count 

aligned  sentences 

27479 

parsed  aligned  sentences 

25292 

anchor  pairs 

43974 

paraphrases 

5925 

unique  paraphrases 

5502 

gathered  paraphrases  (score  >  1.0) 

2886 

Table  1:  Summary  of  the  paraphrase  generation  pro¬ 
cess 


Figure  1 :  Distribution  of  paraphrase  length 


4  Results 

Using  the  approach  described  in  previous  sections, 
we  were  able  to  extract  nearly  six  thousand  different 
paraphrases  (see  Table  1)  from  our  corpus,  which 
consisted  of  two  translations  of  20,000  Leagues  Un¬ 
der  the  Sea,  two  translations  of  The  Kreutzer  Sonata, 
and  three  translations  of  Madame  Bouvary. 

Our  corpus  was  essentially  the  same  as  the  one 
used  by  Barzilay  and  McKeown,  with  the  exception 
of  some  short  fairy  tale  translations  that  we  found  to 
be  unsuitable.  Due  to  the  length  of  sentences  (some 
translations  were  noted  for  their  paragraph-length 
sentences),  the  Link  Parser  was  unable  to  produce 
a  parse  for  approximately  eight  percent  of  the  sen¬ 
tences.  Although  the  Link  Parser  is  capable  of  pro¬ 
ducing  partial  linkages,  accuracy  deteriorated  signif¬ 
icantly  as  the  length  of  the  input  string  increased. 
The  distribution  of  paraphrase  length  is  shown  in 
Figure  1 .  The  length  of  paraphrases  is  measured  by 
the  number  of  words  that  it  contains  (discounting  the 
anchors  on  both  sides). 

To  evaluate  the  accuracy  of  our  results,  130 


Evaluator 

Precision 

Evaluator  1 

36.2% 

Evaluator  2 

40.0% 

Evaluator  3 

44.6% 

Average 

41.2% 

Table  2:  Summary  of  judgments  by  human  evalua¬ 
tors  for  130  unique  paraphrases 

unique  paraphrases  were  randomly  chosen  to  be 
assessed  by  human  judges.  The  human  assessors 
were  specifically  asked  whether  they  thought  the 
paraphrases  were  roughly  interchangeable  with  each 
other,  given  the  context  of  the  genre.  We  believe  that 
the  genre  constraint  was  important  because  some 
paraphrases  captured  literary  or  archaic  uses  of  par¬ 
ticular  words  that  were  not  generally  useful.  This 
should  not  be  viewed  as  a  shortcoming  of  our  ap¬ 
proach,  but  rather  an  artifact  of  our  corpus.  In  ad¬ 
dition,  sample  sentences  containing  the  structural 
paraphrases  were  presented  as  context  to  the  judges; 
structural  paraphrases  are  difficult  to  comprehend 
without  this  information. 

A  summary  of  the  judgments  provided  by  human 
evaluators  is  shown  in  Table  2.  The  average  preci¬ 
sion  of  our  approach  stands  at  just  over  forty  per¬ 
cent;  the  average  length  of  the  paraphrases  learned 
was  3.26  words  long.  Our  results  also  show  that 
judging  structural  paraphrases  is  a  difficult  task  and 
inter-assessor  agreement  is  rather  low.  All  of  the 
evaluators  agreed  on  the  judgments  (either  positive 
or  negative)  only  75.4%  of  the  time.  The  average 
correlation  constant  of  the  judgments  is  only  0.66. 

The  highest  scoring  paraphrase  was  the  equiva¬ 
lence  of  the  possessive  morpheme ’s  with  the  prepo¬ 
sition  of.  We  found  it  encouraging  that  our  algo¬ 
rithm  was  able  to  induce  this  structural  paraphrase, 
complete  with  co-indexed  anchors  on  the  ends  of  the 
paths,  i.e.,  A’s  B  B  of  A.  Some  other  interest¬ 
ing  examples  include:^ 

Ai  liked  A2 

Ai  fond  of  — ^  A2 

Example:  The  clerk  liked  Monsieur  Bovary. 

^Brief  description  of  link  labels:  S:  subject  to  verb;  0:  object 
to  verb;  OF:  certain  verbs  to  of\  K:  verbs  to  particles;  MV:  verbs 
to  certain  modifying  phrases.  See  Link  Parser  documentation 
for  full  descriptions. 


Score  Threshold 

Avg.  Precision 

Avg.  Length 

Count 

>  1.0 

40.2% 

3.24 

130 

>  1.25 

46.0% 

2.88 

58 

>  1.5 

47.8% 

2.22 

23 

>  1.75 

38.9% 

1.67 

12 

Table  3:  Breakdown  of  our  evaluation  results 


The  clerk  was  fond  of  Monsieur  Bovary. 

.  s  ,  K  MV  J  i 

Ai  « - >  rush  — >  over  — >  to  — >  A2  <1=^ 

Ai  « - >  run  — >  to  — >  A2 

Example:  And  he  rushed  over  to  his  son,  who  had 
just  jumped  into  a  heap  of  lime  to  whiten  his  shoes. 

And  he  ran  to  his  son,  who  had  just  precipi¬ 
tated  himself  into  a  heap  of  lime  in  order  to  whiten 
his  boots. 

Ai  « - >  put  — >  on  — >  A2  <1=^ 

-  s  D  . 

Ai  < — ^  wear  — ^  A2 

Example:  That  is  why  he  puts  on  his  best  waistcoat 
and  risks  spoiling  it  in  the  rain.  That’s  why  he 
wears  his  new  waistcoat,  even  in  the  rain! 

.  *  _  MV  I  .  0  , 

Ai  « - >  Ht  — >  to  — >  give  — >  A2 

Ai  < - >  appropriate  — >  to  — >  supply  — >  A2 

Example:  He  thought  fit,  after  the  first  few  mouth¬ 
fuls,  to  give  some  details  as  to  the  catastrophe. 

After  the  first  few  mouthfuls  he  considered  it  appro¬ 
priate  to  supply  a  few  details  concerning  the  catas¬ 
trophe. 

A  more  detailed  breakdown  of  the  evaluation  re¬ 
sults  ean  be  seen  in  Table  3.  Inereasing  the  thresh¬ 
old  for  generating  paraphrases  tends  to  inerease  their 
preeision,  up  to  a  eertain  point.  In  general,  the  high¬ 
est  ranking  struetural  paraphrases  eonsisted  of  sin¬ 
gle  word  paraphrases  of  prepositions,  e.g.,  at  <1=^ 
in.  Our  algorithm  notieed  that  different  prepositions 
were  often  interehangeable,  whieh  is  something  that 
our  human  assessors  disagreed  widely  on.  Beyond  a 
eertain  threshold,  the  aeeuraey  of  our  approaeh  ae- 
tually  deereases. 

5  Discussion 

An  obvious  first  observation  about  our  algorithm  is 
the  dependenee  on  parse  quality;  bad  parses  lead  to 
many  bogus  paraphrases.  Although  the  parse  results 
from  the  Link  Parser  are  far  from  perfeet,  it  is  un- 
elear  whether  other  purely  statistieal  parsers  would 


fare  any  better,  sinee  they  are  generally  trained  on 
eorpora  eontaining  a  totally  different  genre  of  text. 
However,  future  work  will  most  likely  inelude  a 
eomparison  of  different  parsers. 

Examination  of  our  results  show  that  a  better  no¬ 
tion  of  eonstitueney  would  inerease  the  aeeuraey 
of  our  results.  Our  algorithm  oeeasionally  gener¬ 
ates  non-sensieal  paraphrases  that  eross  eonstituent 
boundaries,  for  example,  ineluding  the  verb  of  a 
subordinate  elause  with  elements  from  the  matrix 
elause.  Other  problems  arise  beeause  our  eurrent  al¬ 
gorithm  has  no  notion  of  verb  phrases;  it  often  gen¬ 
erates  near  misses  sueh  as  fail  succeed,  negleet- 

ing  to  inelude  not  as  part  of  the  paraphrase. 

However,  there  are  problems  inherent  in  para¬ 
phrase  generation  that  simple  knowledge  of  eon¬ 
stitueney  alone  eannot  solve.  Consider  the  following 
two  sentenees: 

John  made  out  gold  at  the  bottom  of  the  well. 

John  discovered  gold  near  the  bottom  of  the  well. 

Whieh  struetural  paraphrases  should  we  be  able  to 
extraet? 

made  out  Xat  Y  discovered  X  near  Y 
made  out  X  discovered  X 
at  X  nearX 

Arguably,  all  three  paraphrases  are  valid,  although 
opinions  vary  more  regarding  the  last  paraphrase. 
What  is  the  optimal  level  of  strueture  for  para¬ 
phrases?  Obviously,  this  represents  a  tradeoff  be¬ 
tween  speeifieity  and  aeeuraey,  but  the  ability  of 
struetural  paraphrases  to  eapture  long-distanee  re¬ 
lationships  aeross  large  numbers  of  lexieal  items 
eomplieates  the  problem.  Due  to  the  sparseness  of 
our  data,  our  algorithm  eannot  make  a  good  deei- 
sion  on  what  eonstituents  to  generalize  as  variables; 
naturally,  greater  amounts  of  data  would  alleviate 
this  problem.  This  eurrent  inability  to  deeide  on  a 


good  “scope”  for  paraphrasing  was  a  primary  rea¬ 
son  why  we  were  unable  to  perform  a  strict  eval¬ 
uation  of  recall.  Our  initial  attempts  at  generating 
a  gold  standard  for  estimating  recall  failed  because 
human  judges  could  not  agree  on  the  boundaries  of 
paraphrases. 

The  accuracy  of  our  structural  paraphrases  is 
highly  dependent  on  the  corpus  size.  As  can  be  seen 
from  the  numbers  in  Table  1,  paraphrases  are  rather 
sparse — nearly  93%  of  them  are  unique.  Without 
adequate  statistical  evidence,  validating  candidate 
paraphrases  can  be  very  difficult.  Although  our  data 
spareness  problem  can  be  alleviated  simply  by  gath¬ 
ering  a  larger  corpus,  the  type  of  parallel  text  our 
algorithm  requires  is  rather  hard  to  obtain,  i.e.,  there 
are  only  so  many  translations  of  so  many  foreign 
novels.  Furthermore,  since  our  paraphrases  are  ar¬ 
guably  genre-specific,  differenf  applicafions  may  re¬ 
quire  differenf  fraining  corpora.  Similar  fo  fhe  work 
of  Barzilay  and  Lee  (2003),  who  have  applied  para¬ 
phrase  generafion  techniques  fo  comparable  corpora 
consisfing  of  differenf  newspaper  arficles  abouf  fhe 
same  evenf,  we  are  currenfly  affempfing  fo  solve  fhe 
dafa  sparseness  problem  by  exlending  our  approach 
fo  non-parallel  corpora. 

We  believe  fhaf  generafing  paraphrases  af  fhe 
sfrucfural  level  holds  several  key  advanfages  over 
lexical  paraphrases,  from  fhe  capfuring  of  long- 
disfance  relationships  fo  fhe  more  accurate  modeling 
of  conlexf.  The  paraphrases  generafed  by  our  ap¬ 
proach  could  prove  fo  be  useful  in  any  nafural  lan¬ 
guage  applicafion  where  undersfanding  of  linguis¬ 
tic  variations  is  imporfanf.  In  particular,  we  are  al- 
fempfing  fo  apply  our  resulfs  fo  improve  fhe  perfor¬ 
mance  of  question  answering  sysfem,  which  we  will 
describe  in  fhe  following  section. 

6  Paraphrases  and  Question  Answering 

The  ulfimafe  goal  of  our  work  on  paraphrases  is 
fo  enable  fhe  developmenf  of  high-precision  ques¬ 
tion  answering  system  (cf.  (Kafz  and  Levin,  1988; 
Soubbofin  and  Soubbofin,  2001;  Hermjakob  el  ah, 
2002)).  We  believe  fhaf  a  knowledge  base  of  para¬ 
phrases  is  fhe  key  fo  overcoming  challenges  pre- 
senfed  by  fhe  expressiveness  of  nafural  languages. 
Because  fhe  same  semantic  confenf  can  be  expressed 
in  many  differenf  ways,  a  queslion  answering  sys¬ 


tem  musl  be  able  fo  cope  wifh  a  variety  of  alfernafive 
phrasings.  In  parficular,  an  answer  sfafed  in  a  form 
fhaf  differs  from  fhe  form  of  fhe  queslion  presenls 
significanl  problems: 

When  did  Colorado  become  a  state? 

(la)  Colorado  became  a  state  in  1876. 

(lb)  Colorado  was  admitted  to  the  Union  in  1876. 

Who  killed  Abraham  Lincoln? 

(2a)  John  Wilkes  Booth  killed  Abraham  Lincoln. 

(2b)  John  Wilkes  Booth  ended  Abraham  Lincoln’s 

life  with  a  bullet. 

In  the  above  examples,  question  answering  sys¬ 
tems  have  little  difficulty  extracting  answers  if  the 
answers  are  stated  in  a  form  directly  derived  from 
the  question,  e.g.,  (la)  and  (2a);  simple  keyword 
matching  techniques  with  primitive  named-entity 
detection  technology  will  suffice.  However,  ques¬ 
tion  answering  systems  will  have  a  much  harder  time 
extracting  answers  from  sentences  where  they  are 
not  obviously  stated,  e.g.,  (lb)  and  (2b).  To  re¬ 
late  question  to  answers  in  those  examples,  a  system 
would  need  access  to  rules  like  the  following: 

X  became  a  state  in  Y 

X  was  admitted  to  the  Union  in  Y 

X  killed  Y  X  ended  Y’s  life 

We  believe  that  such  rules  are  best  formulated  at 
the  syntactic  level:  structural  paraphrases  represent 
a  good  level  of  generality  and  provide  much  more 
accurate  results  than  keyword-based  approaches. 

The  simplest  approach  to  overcoming  the  “para¬ 
phrase  problem”  in  question  answering  is  via  key¬ 
word  query  expansion  when  searching  for  candidate 
answers: 

(AND  X  became  state) 

(AND  X  admitted  Union) 

(AND  X  killed)  <;=^ 

(AND  X  ended  life) 

The  major  drawback  of  such  techniques  is  over¬ 
generation  of  bogus  answer  candidates.  For  ex¬ 
ample,  it  is  a  well-known  result  that  query  expan¬ 
sion  based  on  synonymy,  hyponymy,  etc.  may  actu¬ 
ally  degrade  performance  if  done  in  an  uncontrolled 


manner  (Voorhees,  1994).  Typically,  keyword- 
based  query  expansion  techniques  sacrifice  signifi¬ 
cant  amounts  of  precision  for  little  (if  any)  increase 
in  recall. 

The  problems  associated  with  keyword  query  ex¬ 
pansion  techniques  stem  from  the  fundamental  de¬ 
ficiencies  of  “bag-of- words”  approaches;  in  short, 
they  simply  cannot  accurately  model  the  semantic 
content  of  text,  as  illustrated  by  the  following  pairs 
of  sentences  and  phrases  that  have  the  same  word 
content,  but  dramatically  different  meaning: 

(3a)  The  bird  ate  the  snake. 

(3b)  The  snake  ate  the  bird. 

(4a)  the  largest  planet’s  volcanoes 

(4b)  the  planet’s  largest  volcanoes 

(5a)  the  house  by  the  river 

(5b)  the  river  by  the  house 

(6a)  The  Germans  defeated  the  French. 

(6b)  The  Germans  were  defeated  by  the  French. 

The  above  examples  are  nearly  indistinguishable 
in  terms  of  lexical  content,  yet  their  meanings  are 
vastly  different.  Naturally,  because  one  text  frag¬ 
ment  might  be  an  appropriate  answer  to  a  question 
while  the  other  fragment  may  not  be,  a  question  an¬ 
swering  system  seeking  to  achieve  high  precision 
must  provide  mechanisms  for  differentiating  the  se¬ 
mantic  content  of  the  pairs. 

While  paraphrase  techniques  at  the  keyword-level 
vastly  overgenerate,  paraphrase  techniques  at  the 
phrase-level  undergenerate,  that  is,  they  are  often 
too  specific.  Although  paraphrase  rules  can  eas¬ 
ily  be  formulated  at  the  string-level,  e.g.,  using 
regular  expression  matching  and  substitution  tech¬ 
niques  (Soubbotin  and  Soubbotin,  2001;  Hermjakob 
et  al.,  2002),  such  a  treatment  fails  to  capture  im¬ 
portant  linguistic  generalizations.  For  example,  the 
addition  of  an  adverb  typically  does  not  alter  the  va¬ 
lidity  of  a  paraphrase;  thus,  a  phrase-level  rule  “X 
killed  F’  4=^  “X  ended  Y’s  life”  would  not  be  able 
to  match  an  answer  like  “John  Wilkes  Booth  sud¬ 
denly  ended  Abraham  Lincoln’s  life  with  a  bullet”. 
String-level  paraphrases  are  also  unable  to  handle 
syntactic  phenomenona  like  passivization,  which  are 
easily  captured  at  the  syntactic  level. 

We  believe  that  answering  questions  at  level  of 
syntactic  relations,  that  is,  matching  parsed  rep¬ 
resentations  of  questions  with  parsed  representa¬ 


tions  of  candidates,  addresses  the  issues  presented 
above.  Syntactic  relations,  basically  simplified  ver¬ 
sions  of  dependency  structures  derived  from  the 
Link  Parser,  can  capture  significant  portions  of  the 
meaning  present  in  text  documents,  while  providing 
a  flexible  foundation  on  which  to  build  machinery 
for  paraphrases. 

Our  position  is  that  question  answering  should  be 
performed  at  the  level  of  “key  relations”  in  addition 
to  keywords.  We  have  begun  to  experiment  with  re¬ 
lations  indexing  and  matching  techniques  described 
above  using  an  electronic  encyclopedia  as  the  test 
corpus.  We  identified  a  particular  set  of  linguistic 
phenomena  where  relation-based  indexing  can  dra¬ 
matically  boost  the  precision  of  a  question  answer¬ 
ing  system  (Katz  and  Lin,  2003).  As  an  example, 
consider  a  sample  output  from  a  baseline  keyword- 
based  IR  system: 

What  do  frogs  eat? 

(Rl)  Alligators  eat  many  kinds  of  small  animals 
that  live  in  or  near  the  water,  including  fish,  snakes, 
frogs,  turtles,  small  mammals,  and  birds. 

(R2)  Some  bats  catch  fish  with  their  claws,  and  a 
few  species  eat  lizards,  rodents,  birds,  and  frogs. 

(R3)  Bowlins  eat  mainly  other  fish,  frogs,  and  cray¬ 
fish. 

(R4)  Adult  frogs  eat  mainly  insects  and  other  small 
animals,  including  earthworms,  minnows,  and  spi¬ 
ders. 

(R32)  Kookaburras  eat  caterpillars,  fish,  frogs,  in¬ 
sects,  small  mammals,  snakes,  worms,  and  even 
small  birds. 

Of  the  32  sentences  returned,  only  (R4)  correctly 
answers  the  user  query;  the  other  results  answer  a 
different  question — “What  eats  frogs?”  A  bag-of- 
words  approach  fundamentally  cannot  differentiate 
between  a  query  in  which  the  frog  is  in  the  subject 
position  and  a  query  in  which  the  frog  is  in  the  object 
position.  Compare  this  to  the  results  produced  by 
our  relations  matcher: 

What  do  frogs  eat? 

(R4)  Adult  frogs  eat  mainly  insects  and  other  small 
animals,  including  earthworms,  minnows,  and  spi¬ 
ders. 


By  examining  subject-verb-object  relations,  our 
system  can  filter  out  irrelevant  results  and  return 
only  the  correct  responses. 

We  are  currently  working  on  combining  this 
relations-indexing  technology  with  the  automatic 
paraphrase  generation  technology  described  earlier. 
For  example,  our  approach  would  be  capable  of  au¬ 
tomatically  learning  a  paraphrase  like  X  eat  Y  Y 
is  a  prey  ofX,  a  large  collection  of  such  paraphrases 
would  go  a  long  way  in  overcoming  the  brittleness 
associated  with  a  relations-based  indexing  scheme. 

7  Contributions 

We  have  presented  a  method  for  automatically  learn¬ 
ing  structural  paraphrases  from  aligned  monolingual 
corpora  that  overcomes  the  limitation  of  previous 
approaches.  In  addition,  we  have  sketched  how  this 
technology  can  be  applied  to  enhance  the  perfor¬ 
mance  of  a  question  answering  system  based  on  in¬ 
dexing  relations.  Although  we  have  not  completed 
a  task-based  evaluation,  we  believe  that  the  ability 
to  handle  variations  in  language  is  key  to  building 
better  question  answering  systems. 
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